Gwen claims certain attributes are not relevant. See the spreadsheet she attached on Moodle. These attributes WILL NOT be filled in.

Data Pre-Processing

library(corrplot)
corrplot 0.92 loaded
library(ggplot2)
library(tidyverse)
Registered S3 methods overwritten by 'dbplyr':
  method         from
  print.tbl_lazy     
  print.tbl_sql      
── Attaching packages ─────────────────────────────────────────────────────── tidyverse 1.3.2 ──✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.1      ✔ stringr 1.4.1 
✔ readr   2.1.2      ✔ forcats 0.5.2 
✔ purrr   0.3.4      ── Conflicts ────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
library(fastDummies)
library(GGally)
Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2

Get rid of old stuff

rm(list=ls())
par(mfrow=c(1,1))
# simple R program to read csv file using read.table()
raw_data <- read.csv2("./LCdata.csv", header = TRUE, row.names=NULL, sep=";")


head(raw_data)
summary(raw_data)
       id             member_id          loan_amnt      funded_amnt    funded_amnt_inv   
 Min.   :   54734   Min.   :   70473   Min.   :  500   Min.   :  500   Length:798641     
 1st Qu.: 9207230   1st Qu.:10877939   1st Qu.: 8000   1st Qu.: 8000   Class :character  
 Median :34433372   Median :37095300   Median :13000   Median :13000   Mode  :character  
 Mean   :32463636   Mean   :35000265   Mean   :14754   Mean   :14741                     
 3rd Qu.:54900100   3rd Qu.:58470266   3rd Qu.:20000   3rd Qu.:20000                     
 Max.   :68617057   Max.   :73544841   Max.   :35000   Max.   :35000                     
                                                                                         
     term             int_rate         installment         emp_title          emp_length       
 Length:798641      Length:798641      Length:798641      Length:798641      Length:798641     
 Class :character   Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                                               
                                                                                               
                                                                                               
                                                                                               
 home_ownership      annual_inc        verification_status   issue_d         
 Length:798641      Length:798641      Length:798641       Length:798641     
 Class :character   Class :character   Class :character    Class :character  
 Mode  :character   Mode  :character   Mode  :character    Mode  :character  
                                                                             
                                                                             
                                                                             
                                                                             
 loan_status         pymnt_plan            url                desc             purpose         
 Length:798641      Length:798641      Length:798641      Length:798641      Length:798641     
 Class :character   Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                                               
                                                                                               
                                                                                               
                                                                                               
    title             zip_code          addr_state            dti             delinq_2yrs     
 Length:798641      Length:798641      Length:798641      Length:798641      Min.   : 0.0000  
 Class :character   Class :character   Class :character   Class :character   1st Qu.: 0.0000  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character   Median : 0.0000  
                                                                             Mean   : 0.3145  
                                                                             3rd Qu.: 0.0000  
                                                                             Max.   :39.0000  
                                                                             NA's   :25       
 earliest_cr_line   inq_last_6mths    mths_since_last_delinq mths_since_last_record
 Length:798641      Min.   : 0.0000   Min.   :  0.0          Min.   :  0.0         
 Class :character   1st Qu.: 0.0000   1st Qu.: 15.0          1st Qu.: 51.0         
 Mode  :character   Median : 0.0000   Median : 31.0          Median : 70.0         
                    Mean   : 0.6947   Mean   : 34.1          Mean   : 70.1         
                    3rd Qu.: 1.0000   3rd Qu.: 50.0          3rd Qu.: 92.0         
                    Max.   :33.0000   Max.   :188.0          Max.   :129.0         
                    NA's   :25        NA's   :408818         NA's   :675190        
    open_acc        pub_rec          revol_bal        revol_util          total_acc     
 Min.   : 0.00   Min.   : 0.0000   Min.   :      0   Length:798641      Min.   :  1.00  
 1st Qu.: 8.00   1st Qu.: 0.0000   1st Qu.:   6443   Class :character   1st Qu.: 17.00  
 Median :11.00   Median : 0.0000   Median :  11876   Mode  :character   Median : 24.00  
 Mean   :11.55   Mean   : 0.1953   Mean   :  16930                      Mean   : 25.27  
 3rd Qu.:14.00   3rd Qu.: 0.0000   3rd Qu.:  20839                      3rd Qu.: 32.00  
 Max.   :90.00   Max.   :63.0000   Max.   :2904836                      Max.   :169.00  
 NA's   :25      NA's   :25        NA's   :2                            NA's   :25      
 initial_list_status  out_prncp         out_prncp_inv      total_pymnt       
 Length:798641       Length:798641      Length:798641      Length:798641     
 Class :character    Class :character   Class :character   Class :character  
 Mode  :character    Mode  :character   Mode  :character   Mode  :character  
                                                                             
                                                                             
                                                                             
                                                                             
 total_pymnt_inv    total_rec_prncp    total_rec_int      total_rec_late_fee  recoveries       
 Length:798641      Length:798641      Length:798641      Length:798641      Length:798641     
 Class :character   Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                                               
                                                                                               
                                                                                               
                                                                                               
 collection_recovery_fee last_pymnt_d       last_pymnt_amnt    next_pymnt_d      
 Length:798641           Length:798641      Length:798641      Length:798641     
 Class :character        Class :character   Class :character   Class :character  
 Mode  :character        Mode  :character   Mode  :character   Mode  :character  
                                                                                 
                                                                                 
                                                                                 
                                                                                 
 last_credit_pull_d collections_12_mths_ex_med mths_since_last_major_derog  policy_code
 Length:798641      Min.   : 0.00000           Min.   :  0.0               Min.   :1   
 Class :character   1st Qu.: 0.00000           1st Qu.: 27.0               1st Qu.:1   
 Mode  :character   Median : 0.00000           Median : 44.0               Median :1   
                    Mean   : 0.01447           Mean   : 44.1               Mean   :1   
                    3rd Qu.: 0.00000           3rd Qu.: 61.0               3rd Qu.:1   
                    Max.   :20.00000           Max.   :188.0               Max.   :1   
                    NA's   :126                NA's   :599107                          
 application_type   annual_inc_joint    dti_joint         verification_status_joint
 Length:798641      Length:798641      Length:798641      Length:798641            
 Class :character   Class :character   Class :character   Class :character         
 Mode  :character   Mode  :character   Mode  :character   Mode  :character         
                                                                                   
                                                                                   
                                                                                   
                                                                                   
 acc_now_delinq       tot_coll_amt      tot_cur_bal       open_acc_6m       open_il_6m    
 Min.   : 0.000000   Min.   :      0   Min.   :      0   Min.   : 0.0     Min.   : 0.0    
 1st Qu.: 0.000000   1st Qu.:      0   1st Qu.:  29861   1st Qu.: 0.0     1st Qu.: 1.0    
 Median : 0.000000   Median :      0   Median :  80647   Median : 1.0     Median : 2.0    
 Mean   : 0.005026   Mean   :    228   Mean   : 139508   Mean   : 1.1     Mean   : 2.9    
 3rd Qu.: 0.000000   3rd Qu.:      0   3rd Qu.: 208229   3rd Qu.: 2.0     3rd Qu.: 4.0    
 Max.   :14.000000   Max.   :9152545   Max.   :8000078   Max.   :14.0     Max.   :33.0    
 NA's   :25          NA's   :63276     NA's   :63276     NA's   :779525   NA's   :779525  
  open_il_12m      open_il_24m     mths_since_rcnt_il  total_bal_il      il_util         
 Min.   : 0.0     Min.   : 0.0     Min.   :  0.0      Min.   :     0   Length:798641     
 1st Qu.: 0.0     1st Qu.: 0.0     1st Qu.:  6.0      1st Qu.: 10164   Class :character  
 Median : 0.0     Median : 1.0     Median : 12.0      Median : 24545   Mode  :character  
 Mean   : 0.8     Mean   : 1.7     Mean   : 21.1      Mean   : 36429                     
 3rd Qu.: 1.0     3rd Qu.: 2.0     3rd Qu.: 23.0      3rd Qu.: 47640                     
 Max.   :12.0     Max.   :19.0     Max.   :363.0      Max.   :878459                     
 NA's   :779525   NA's   :779525   NA's   :780030     NA's   :779525                     
  open_rv_12m      open_rv_24m       max_bal_bc       all_util         total_rev_hi_lim 
 Min.   : 0.0     Min.   : 0       Min.   :    0    Length:798641      Min.   :      0  
 1st Qu.: 0.0     1st Qu.: 1       1st Qu.: 2406    Class :character   1st Qu.:  13900  
 Median : 1.0     Median : 2       Median : 4502    Mode  :character   Median :  23700  
 Mean   : 1.4     Mean   : 3       Mean   : 5878                       Mean   :  32093  
 3rd Qu.: 2.0     3rd Qu.: 4       3rd Qu.: 7774                       3rd Qu.:  39800  
 Max.   :22.0     Max.   :43       Max.   :83047                       Max.   :9999999  
 NA's   :779525   NA's   :779525   NA's   :779525                      NA's   :63276    
     inq_fi        total_cu_tl      inq_last_12m   
 Min.   : 0.0     Min.   : 0.0     Min.   :-4      
 1st Qu.: 0.0     1st Qu.: 0.0     1st Qu.: 0      
 Median : 0.0     Median : 0.0     Median : 2      
 Mean   : 0.9     Mean   : 1.5     Mean   : 2      
 3rd Qu.: 1.0     3rd Qu.: 2.0     3rd Qu.: 3      
 Max.   :16.0     Max.   :35.0     Max.   :32      
 NA's   :779525   NA's   :779525   NA's   :779525  
my_data <- raw_data

Start cleaning up the columns

id

A unique LC assigned ID for the loan listing.

Is: integer Should be: integer and non-zero

filter(my_data, is.na(my_data$id)) # all are non-Null
filter(my_data, id == 0) # all are non-zero
filter(my_data, id < 0) # all are non-zero

Is not relevant to decision-making and should be dropped.

member_id

A unique LC assigned Id for the borrower member.

is: integer should be: integer, non-Null, non-zero

filter(my_data, is.na(my_data$member_id)) # all are non-Null
filter(my_data, member_id == 0) # all are non-zero
filter(my_data, member_id < 0) # all are positive

This is probably not relevant, but it could be the case that a member has obtained prior loans, and then that could be very relevant for the interest rate on this loan.

This never actually happens. So better to drop it.

loan_amt

The listed amount of the loan applied for by the borrower. If at some point in time, the credit department reduces the loan amount, then it will be reflected in this value.

Is: integer Should be: integer, positive, non-zero, non-null

Data Exploration: 1. Histogram of the loan_amt in various binwidths

filter(my_data, is.na(my_data$loan_amnt)) # all are non-Null
filter(my_data, loan_amnt == 0) # all are non-zero
filter(my_data, loan_amnt < 0) # all are positive
summary(loan_amnt)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    500    8000   13000   14754   20000   35000 
ggplot(data = my_data) +
  geom_histogram(aes(x = loan_amnt), binwidth = 100)

ggplot(data = my_data) +
  geom_histogram(aes(x = loan_amnt), binwidth = 250)

ggplot(data = my_data) +
  geom_histogram(aes(x = loan_amnt), binwidth = 500)

ggplot(data = my_data) +
  geom_histogram(aes(x = loan_amnt), binwidth = 1000)

funded_amnt

The total amount committed to that loan at that point in time.

is: integer should be: integer, non-null, non-negative

clear work done on this column

funded_amnt <-raw_data$funded_amnt

View the data


head(my_data$funded_amnt)
[1] 2600 6700 7000 3000 2525 5075
typeof(my_data$funded_amnt)
[1] "integer"
summary(my_data$funded_amnt)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    500    8000   13000   14741   20000   35000 
filter(my_data, is.na(funded_amnt)) # no Null
filter(my_data, funded_amnt == 0) # no Zero
filter(my_data, funded_amnt < 0) # no negative

See where funded_amnt greater than, less than, equal to loan_amt

Visualize them together

ggplot(data = my_data) +
  geom_point(aes(x = loan_amnt, y = funded_amnt), alpha = 0.05)

funded_amnt_inv

The total amount committed by investors for that loan at that point in time. is: character should be: integer (or double?)

clear work done on this column

funded_amnt_inv <-raw_data$funded_amnt_inv

Turn them into integers

head(my_data$funded_amnt_inv)
[1] "575"  "6700" "3450" "125"  "225"  "350" 
my_data$funded_amnt_inv <- as.integer(my_data$funded_amnt_inv)
head(my_data$funded_amnt_inv)
[1]  575 6700 3450  125  225  350

See where funded_amnt_inv greater than, less than, equal to loan_amt

ggplot(data = my_data) +
  geom_point(aes(x = loan_amnt, y = funded_amnt_inv), alpha = 0.05)

This shows two different curves, indicating two different types of bowers receiving investor funding. What is the difference between the two types of borrowers?

term

The number of payments on the loan. Values are in months and can be either 36 or 60.

NOTE Gwen says this will not be available in new loan applications, but this makes no sense as there is a distinct difference in interest rates for 36 month versus 60 month loans.

is: character, either ” 36 months” or ” 60 months” should be: integer, either 36 or 60 (to represent number of months). But does it even matter if it’s essentially categorical?

This is similar to a categorical variable. Perhaps it explains the two curves observed in funded_amt_inv. (note, no, it does not appear to after analysis)

clear work done on this column

head(term)
[1] " 36 months" " 36 months" " 36 months" " 36 months" " 36 months" " 36 months"

Use this if keeping it as a string

my_data$term <- str_trim(my_data$term)
head(my_data$term)
[1] "36 months" "36 months" "36 months" "36 months" "36 months" "36 months"

visualizing the data

head(term)
[1] " 36 months" " 36 months" " 36 months" " 36 months" " 36 months" " 36 months"

There are about twice as many 36 month loans as 60 month.

Let’s plot the funded_amnt_inv filtering by 36 month and 60 month

This looks a lot like the other graph with no filtering.

This shows three curves. First, there is an upper curve where it begins linearly. Second, there is a discontinuity where the slope drastically changes towards 0 (and possibly appears to be upward curving?) Third, there is the lower curve, which appears more linear than those for 36 month terms, but still downward curving overall.

Visualize term versus loan amount

ggplot() + 
  geom_boxplot(mapping = aes(x = term, y = loan_amnt), data = filter(my_data, term == "60 months")) + 
  geom_boxplot(mapping = aes(x = term, y = loan_amnt), data = filter(my_data, term == "36 months"))

60 month loans are for more money than 36 month loans

ggplot() + 
  geom_boxplot(mapping = aes(x = term, y = funded_amnt), data = filter(my_data, term == "36 months")) + 
  geom_boxplot(mapping = aes(x = term, y = funded_amnt), data = filter(my_data, term == "60 months"))

The below provides a worthless graph. Nearly all loans fund regardless of term.

ggplot() + 
  geom_boxplot(mapping = aes(x = term, y = funded_amnt / loan_amnt), data = filter(my_data, term == "36 months")) + 
  geom_boxplot(mapping = aes(x = term, y = funded_amnt / loan_amnt), data = filter(my_data, term == "60 months"))

int_rate

Interest Rate on the loan

is: character should be: float

head(my_data$int_rate)
[1] "8.38"  "7.75"  "7.75"  "9.01"  "9.33"  "10.28"
my_data$int_rate <- as.double(my_data$int_rate)
head(my_data$int_rate)
[1]  8.38  7.75  7.75  9.01  9.33 10.28

ggplot(my_data) +
  geom_histogram(aes(x = int_rate), binwidth = .1)

ggplot(my_data) +
  geom_histogram(aes(x = int_rate), binwidth = .25)

ggplot(my_data) +
  geom_histogram(aes(x = int_rate), binwidth = .5)

ggplot(my_data) +
  geom_histogram(aes(x = int_rate), binwidth = 1)

Interest rate versus loan amount 60 month loans have a higher starting interest rate, and there are very few under 10,000 loan amount

ggplot(data = my_data, mapping = aes(x = loan_amnt, y = int_rate, color = term)) +
    geom_point(alpha = .05)

installment

The monthly payment owed by the borrower if the loan originates.

Note Gwen says this will not be available in the new data.

This should be directly correlated with the loan amount and interest rate and determined by term. It may be useful to use these values instead of those when say, comparing to income, in order to avoid calculation.

Is: string Should be: double

head(my_data$installment)
[1] "81.94"  "209.18" "218.55" "95.42"  "80.69"  "164.42"
my_data$installment <- as.double(my_data$installment)
head(my_data$installment)
[1]  81.94 209.18 218.55  95.42  80.69 164.42

emp_title

The job title supplied by the Borrower when applying for the loan. Employer Title replaces Employer Name for all loans listed after 9/23/2013

This is a string entered by the user and should be dropped.

my_data <- subset(my_data, select = -emp_title)

emp_length

Employment length in years. Possible values are between 0 and 10 where 0 means less than one year and 10 means ten or more years.

Is: string Should be: Integer

head(my_data$emp_length)
[1] 3 0 0 0 0 0

Search NA

More than 40,000 rows have no employment information. That seems like a lot to drop. How should we handle them?

home_ownership

The home ownership status provided by the borrower during registration. Our values are: RENT, OWN, MORTGAGE, OTHER.

Is: String Should be: Perhaps also an ordered set of integers (“OTHER”, “RENT”, “MORTGAGE”, “OWN”). Could also make dummy columns, but this is my order if I were assessing borrowers. We should compare both models.

head(my_data$home_ownership)
[1] "MORTGAGE" "NONE"     "NONE"     "RENT"     "RENT"     "RENT"    
my_data$home_ownership <- as.integer(ordered(my_data$home_ownership, levels = c("OTHER", "RENT", "MORTGAGE", "OWN")))
head(my_data$home_ownership)
[1]  3 NA NA  2  2  2

Drop nulls. There are very few (47)

filter(my_data, is.na(my_data$home_ownership))
my_data <- filter(my_data, ! is.na(my_data$home_ownership))
filter(my_data, is.na(my_data$home_ownership))

annual_inc

The self-reported annual income provided by the borrower during registration.

Is: string should be: Int

Filter for null, zero, negative

filter(my_data, is.na(my_data$annual_inc)) # four items are non-Null, drop them.
filter(my_data, my_data$annual_inc == 0) # two are zero income, but significant income from joint applicant, so don't drop them.
filter(my_data, my_data$annual_inc < 10000) # 460 applicants.
filter(my_data, my_data$annual_inc < 500) # only the two 0 income applicants
filter(my_data, my_data$annual_inc < 0) # all are positive

Drop NA

my_data <- filter(my_data, ! is.na(annual_inc))
head(my_data$annual_inc)
[1]   6500  35000 110000  95000 150000  59000
ggplot(data = filter(my_data, annual_inc > 1000000), aes(x = annual_inc, y = int_rate)) +
         geom_point(aes(color = verification_status))

We should divide income over $1mil by 100 because presumably some of these people entered dollars and cents, but the system did not recognize that they were entering cents.

Needs a decision on the limit of income, where we will assume incorrect data entry. I assume $1,000,000 should have been $10,000.00, but it could be higher or lower.


my_data$annual_inc <- ifelse(my_data$annual_inc >= 1000000, my_data$annual_inc / 100, my_data$annual_inc)
filter(my_data, annual_inc > 1000000)


summary(my_data$annual_inc)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      0   45000   64999   74559   90000  999999 

verification_status

is_inc_v in the dictionary Indicates if income was verified by LC, not verified, or if the income source was verified

make it an integer, with not verified at 0 in order to show the gain from verified

levels(my_data$verification_status)
NULL

no NA

Does verification have a benefit? Yes

lm.fit <- lm(int_rate~verification_status, my_data)
summary(lm.fit)

Call:
lm(formula = int_rate ~ verification_status, data = my_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.2762 -3.2179 -0.2179  2.6738 17.1704 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)         11.819567   0.007769  1521.4   <2e-16 ***
verification_status  1.388327   0.005991   231.8   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.242 on 798592 degrees of freedom
Multiple R-squared:  0.06302,   Adjusted R-squared:  0.06302 
F-statistic: 5.371e+04 on 1 and 798592 DF,  p-value: < 2.2e-16
ggplot(data = filter(my_data, verification_status == 0), aes(x = verification_status, y = int_rate)) +
  geom_boxplot()

ggplot(data = filter(my_data, verification_status == 1), aes(x = verification_status, y = int_rate)) +
  geom_boxplot()

ggplot(data = filter(my_data, verification_status == 2), aes(x = verification_status, y = int_rate)) +
  geom_boxplot()

This is surprising that people with verified income have a higher interest rate than people with an unverified income.

issue_d

The month which the loan was funded

Note Gwen says this will not be in new data.

Is: string Should be:

Is this useful? Probably would imagine that interest rates change.

Added the date as 01 in order to convert to a date.

head(my_data$issue_d)
[1] "Jun-2007" "Jun-2007" "Jun-2007" "Jul-2007" "Jul-2007" "Jul-2013"
f <- my_data$issue_d

for (n in 1:length(f)){
  f[n] <- paste("01-", f[n], sep = "")
}
my_data$issue_d <- as.Date(f, format = "%d-%b-%Y")
head(my_data$issue_d)
[1] "2007-06-01" "2007-06-01" "2007-06-01" "2007-07-01" "2007-07-01" "2013-07-01"

See interest rate over time. I need to make a mean interest rate for each month, or maybe filter by term, to make this more helpful.

ggplot(my_data, aes(x = issue_d, y = int_rate)) +
  geom_line()

loan_status

Current status of the loan

This is for after the loan is issued, so I don’t think it will be useful in predicting interest rate.

head(my_data$loan_status)
[1] "Does not meet the credit policy. Status:Charged Off"
[2] "Does not meet the credit policy. Status:Fully Paid" 
[3] "Does not meet the credit policy. Status:Fully Paid" 
[4] "Does not meet the credit policy. Status:Fully Paid" 
[5] "Does not meet the credit policy. Status:Fully Paid" 
[6] "Charged Off"                                        

This applies only to current loans and should be deleted

my_data <- subset(my_data, select = -loan_status)

pymnt_plan

Indicates if a payment plan has been put in place for the loan

This applies only to current, defaulted loans and should be deleted

my_data <- subset(my_data, select = -pymnt_plan)

url

URL for the LC page with listing data.

This information is not relevant to analysis and should be dropped.

my_data <- subset(my_data, select = -url)

desc

Loan description provided by the borrower

This information could be important but would be impossible to analyse since it is user-entered.

my_data <- subset(my_data, select = -desc)

purpose

A category provided by the borrower for the loan request.

Is: String Should be: Dummy columns.

The answer here is relevant for the interest rate.

levels(factor(my_data$purpose))
 [1] "car"                "credit_card"        "debt_consolidation" "educational"       
 [5] "home_improvement"   "house"              "major_purchase"     "medical"           
 [9] "moving"             "other"              "renewable_energy"   "small_business"    
[13] "vacation"           "wedding"           
ggplot(my_data, aes(x = purpose, y = int_rate)) +
  geom_boxplot()

Should I remove the first dummy to guard against multicolliniarity?

Need to see whether each category has a good number of observations.

title

The loan title provided by the borrower

This information could be important but would be impossible to analyse since it is user-entered.

my_data <- subset(my_data, select = -title)

zip_code

The first 3 numbers of the zip code provided by the borrower in the loan application.

This information would indicate where in the country the borrower is located. It might be a relevant variable, but it would at best be an approximation of a borrower’s income. I do not think it provides significant information.

my_data <- subset(my_data, select = -zip_code)

addr_state

The state provided by the borrower in the loan application

This information would indicate where in the country the borrower is located. It might be a relevant variable, but it would at best be an approximation of a borrower’s income. I do not think it provides significant information.

my_data <- subset(my_data, select = -addr_state)

dti

A ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly income.

Is: string Should be: double

filter(my_data, is.na(dti)) # no Null
filter(my_data, dti == 0) # 403 are zero, but this indicates no debt. Seems odd they turn to Lending Club for a loan if this is true.
filter(my_data, dti < 1)
filter(my_data, dti < 0) # no negative
head(my_data$dti)
[1] "6.46"  "10"    "10"    "10"    "5.6"   "24.55"
f <- as.double(my_data$dti)
head(f)
[1]  6.46 10.00 10.00 10.00  5.60 24.55

delinq_2yrs

The number of 30+ days past-due incidences of delinquency in the borrower’s credit file for the past 2 years

Is: Integer Should be: Integer

typeof(my_data$delinq_2yrs)
[1] "integer"
summary(my_data$delinq_2yrs)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
 0.0000  0.0000  0.0000  0.3145  0.0000 39.0000      21 

There are 21 NAs. Drop them.

my_data <- filter(my_data, ! is.na(my_data$delinq_2yrs))
summary(my_data$delinq_2yrs)

earliest_cr_line

The month the borrower’s earliest reported credit line was opened

filter(raw_data, is.na(earliest_cr_line)) # no NA

This could be a placeholder for the borrower’s age. Is it correlated with interest rate? I would imagine that if this date is old, the person is old, so has a higher interest rate, and if young, low income, so higher interest rate. But this would simply be because of my expectations of income by age, and this data does not directly tell us anything.

Added the date as 01 in order to convert to a date.

head(my_data$earliest_cr_line)
[1] "Mar-1984" "Dec-1994" "Jul-1993" "Jan-1983" "Oct-1996" "Dec-1997"
f <- my_data$earliest_cr_line

for (n in 1:length(f)){
  f[n] <- paste("01-", f[n], sep = "")
}
my_data$earliest_cr_line <- as.Date(f, format = "%d-%b-%Y")
head(my_data$earliest_cr_line)
[1] "1984-03-01" "1994-12-01" "1993-07-01" "1983-01-01" "1996-10-01" "1997-12-01"

Drop the data as it is not really useful.

my_data <- subset(my_data, select = -earliest_cr_line)

inq_last_6mths

The number of inquiries in past 6 months (excluding auto and mortgage inquiries)

This represents how many times a lender has pulled the borrower’s credit report. In itself it does not mean anything, but a higher value is taken as an indication that the borrower is not creditworthy (else, why ask for so much debt?) 0-2 is fine.

There are no NA values

Plot the range

ggplot(data = my_data, aes(x = inq_last_6mths)) +
  geom_histogram(binwidth = 1)

Power law distribution.

summary(lm.1)

Call:
lm(formula = int_rate ~ inq_last_6mths, data = my_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-31.3484  -3.3794  -0.2518   2.7606  16.4406 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)    12.549391   0.005818  2157.0   <2e-16 ***
inq_last_6mths  1.001219   0.004785   209.2   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.267 on 798571 degrees of freedom
Multiple R-squared:  0.05197,   Adjusted R-squared:  0.05197 
F-statistic: 4.378e+04 on 1 and 798571 DF,  p-value: < 2.2e-16

This has a significant impact on interest rate.

mths_since_last_delinq

The number of months since the borrower’s last delinquency.

This is the time that has passed since a buyer was delinquent. Approximately half of the entries are NA. HOWEVER if the borrower has never been delinquent (that is, always pays on time), it SHOULD be NA. NA is a better value than having anything here. BUT the data shows this is not actually true, and those with NA have a higher interest rate.

Possible solutions: very, very high value to replace NA?

ggplot(data = my_data, aes(x = mths_since_last_delinq, y = int_rate)) +
  geom_point(alpha = 0.05)

There is a hard cutoff at 84 months because those delinquencies are no longer reported on a credit report (7 years). There are still some observations beyond that time period, but I am not sure why. They may pre-date the change in the law (I recall this was sometime around 2009-2010) or loopholes that allow continuing reporting, or self-reporting to LendingClub.

We need to figure out a way to deal with the <84, >=84 problem.

count(my_data, mths_since_last_delinq == 83)
count(my_data, mths_since_last_delinq == 84)
count(my_data, mths_since_last_delinq == 85)
NA

Those with an NA have a higher interest rate, which I would not expect.

ggplot(data = my_data) +
  geom_boxplot(mapping = aes(x = is.na(mths_since_last_delinq), y = int_rate))

mths_since_last_record

The number of months since the last public record.

This would be the months since the last public record filing, which means a lawsuit to collect a debt. From my old credit report from 2015: “Public record information includes bankruptcies, liens or judgments and comes from federal, state or county court records.” Today, they include only bankruptcies, but this is not relevant.

This is very, very bad, and much better if it is NA (i.e. there are none)

levels(factor(my_data$mths_since_last_record))
  [1] "0"   "1"   "2"   "3"   "4"   "5"   "6"   "7"   "8"   "9"   "10"  "11"  "12"  "13"  "14" 
 [16] "15"  "16"  "17"  "18"  "19"  "20"  "21"  "22"  "23"  "24"  "25"  "26"  "27"  "28"  "29" 
 [31] "30"  "31"  "32"  "33"  "34"  "35"  "36"  "37"  "38"  "39"  "40"  "41"  "42"  "43"  "44" 
 [46] "45"  "46"  "47"  "48"  "49"  "50"  "51"  "52"  "53"  "54"  "55"  "56"  "57"  "58"  "59" 
 [61] "60"  "61"  "62"  "63"  "64"  "65"  "66"  "67"  "68"  "69"  "70"  "71"  "72"  "73"  "74" 
 [76] "75"  "76"  "77"  "78"  "79"  "80"  "81"  "82"  "83"  "84"  "85"  "86"  "87"  "88"  "89" 
 [91] "90"  "91"  "92"  "93"  "94"  "95"  "96"  "97"  "98"  "99"  "100" "101" "102" "103" "104"
[106] "105" "106" "107" "108" "109" "110" "111" "112" "113" "114" "115" "116" "117" "118" "119"
[121] "120" "121" "129"
count(my_data, is.na(mths_since_last_record))
ggplot(data = my_data, aes(x = mths_since_last_record, y = int_rate)) +
  geom_point(alpha = 0.05)

I do not see a pattern appear.

They also remain on your credit report for ten years during this time period (this is no longer true, but not relevant)

count(my_data, mths_since_last_record == 119)
count(my_data, mths_since_last_record == 120)
NA

Thos with no public records have higher interest rates than those who do, which I would not expect.

ggplot(data = my_data) +
  geom_boxplot(mapping = aes(x = is.na(mths_since_last_record), y = int_rate))

open_acc

The number of open credit lines in the borrower’s credit file.

Is: Integer Should be: Integer

I would expect this to be generally on the high side for users of this platform.

typeof(my_data$open_acc)
[1] "integer"
levels(factor(my_data$open_acc))
 [1] "0"  "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13" "14" "15" "16" "17"
[19] "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28" "29" "30" "31" "32" "33" "34" "35"
[37] "36" "37" "38" "39" "40" "41" "42" "43" "44" "45" "46" "47" "48" "49" "50" "51" "52" "53"
[55] "54" "55" "56" "57" "58" "59" "60" "61" "62" "63" "64" "65" "66" "67" "68" "75" "79" "82"
[73] "84" "90"
count(my_data, is.na(open_acc)) # no NA
ggplot(data = my_data, aes(x = open_acc)) +
  geom_histogram()


summary(my_data$open_acc)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    8.00   11.00   11.55   14.00   90.00 

Is this associated with int_rate?

lm.2 <- lm(int_rate~open_acc, my_data)
summary(lm.2)

Call:
lm(formula = int_rate ~ open_acc, data = my_data)

Residuals:
   Min     1Q Median     3Q    Max 
-8.018 -3.277 -0.251  2.941 15.995 

Coefficients:
              Estimate Std. Error  t value Pr(>|t|)    
(Intercept) 13.3463730  0.0117312 1137.684   <2e-16 ***
open_acc    -0.0087796  0.0009227   -9.515   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.382 on 798571 degrees of freedom
Multiple R-squared:  0.0001134, Adjusted R-squared:  0.0001121 
F-statistic: 90.53 on 1 and 798571 DF,  p-value: < 2.2e-16

This says the interest rate goes down for people with more open accounts and that this is significant, which seems odd. The relationship is probably non-linear, if significant.

pub_rec

Number of derogatory public records

Is: Integer Should be: Integer

Histogram shows very, very strongly power law.

ggplot(data = my_data, aes(x = pub_rec)) +
  geom_histogram()

summary(my_data$pub_rec)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0000  0.0000  0.0000  0.1953  0.0000 63.0000 

revol_balance

Total credit revolving balance

This is the revolving debt (credit card, usually) of the borrower. It will be a part of the dti calculation.

Two are na. One has a revol_util figure, so I expect this is a data error. The other has a revol_util of 0, so this may be showing a revol_bal == 0. Since it is only two observations, we should drop them.

revol_util

Revolving line utilization rate, or the amount of credit the borrower is using relative to all available revolving credit.

This is a ratio showing how maxed out the borrower’s credit cards are. We would expect creditworthy borrowers to have a low ratio.

Is: string Should be: double

ggplot(data = my_data, aes(x = revol_util, y = int_rate)) +
  geom_point(alpha = 0.05)

Has a fairly normal distribution.

total_acc

The total number of credit lines currently in the borrower’s credit file

I do not expect this to be useful to the analysis. It does not distinguish between open or closed accounts. In general not a good predictor.

Is: integer Should be: integer

head(my_data$total_acc)
[1] 16 19 43 28 20 35

initial_list_status

The initial listing status of the loan. Possible values are – W, F

“Lending Club reserves a few loans for 12 hours and offers them to the institutional and large retail lenders who want to lend the whole amount for a loan. I am not sure whether the historical loan data file includes the loans that were offered and picked up by lenders as ‘whole’ loans. But, the loans that were initially offered as whole, designated with ‘w’, but not picked up as ‘whole’ loans are listed in historical loan data file.” https://andirog.blogspot.com/2013/04/lending-club-borrowers-income.html

This may be useful because those loans taken by institutional and large lenders may be the most preferred loans by lenders.

head(my_data$initial_list_status)
[1] "f" "f" "f" "f" "f" "f"

Is there a difference?

w loans tend to have a lower interest rate.

Make it an integer, with f == 0 and w == 1, but this could instead be dummy columns

head(my_data$initial_list_status)
[1] "f" "f" "f" "f" "f" "f"
my_data$initial_list_status <- as.integer(factor(my_data$initial_list_status)) - 1
head(my_data$initial_list_status)
[1] 0 0 0 0 0 0

Is this useful?

lm.3 <- lm(int_rate~initial_list_status, my_data)
summary(lm.3)

Call:
lm(formula = int_rate ~ initial_list_status, data = my_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.4143 -3.5559 -0.2359  2.8557 16.2641 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)         13.734260   0.006789  2022.9   <2e-16 ***
initial_list_status -1.008355   0.009747  -103.5   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.353 on 798569 degrees of freedom
Multiple R-squared:  0.01323,   Adjusted R-squared:  0.01322 
F-statistic: 1.07e+04 on 1 and 798569 DF,  p-value: < 2.2e-16

out_prncp

Remaining outstanding principal for total amount funded

This is only relevant after a loan is issued and should be dropped.

my_data <- subset(my_data, select = -out_prncp)
Error in eval(substitute(select), nl, parent.frame()) : 
  object 'out_prncp' not found

out_prncp_inv

Remaining outstanding principal for portion of total amount funded by investors

This is only relevant after a loan is issued and should be dropped.

my_data <- subset(my_data, select = -out_prncp_inv)

total_pymnt

Payments received to date for total amount funded

This is only relevant after a loan is issued and should be dropped.

my_data <- subset(my_data, select = -total_pymnt)

total_pymnt_inv

Payments received to date for portion of total amount funded by investors

This is only relevant after a loan is issued and should be dropped.

my_data <- subset(my_data, select = -total_pymnt_inv)


head(my_data$dti_joint)
f <- as.double(my_data$dti_joint)
summary(f)
---
title: "R Notebook"
output: html_notebook
---
Gwen claims certain attributes are not relevant. See the spreadsheet she attached on Moodle.
These attributes WILL NOT be filled in.

# Data Pre-Processing

```{r}
library(corrplot)
library(ggplot2)
library(tidyverse)
library(fastDummies)
library(GGally)
```

Get rid of old stuff
```{r}
rm(list=ls())
par(mfrow=c(1,1))

```

```{r}
raw_data <- read.csv2("./LCdata.csv", header = TRUE, row.names=NULL, sep=";")


head(raw_data)
summary(raw_data)

my_data <- raw_data
```

Start cleaning up the columns

### id
A unique LC assigned ID for the loan listing.

Is: integer
Should be: integer and non-zero
```{r}
filter(my_data, is.na(my_data$id)) # all are non-Null
filter(my_data, id == 0) # all are non-zero
filter(my_data, id < 0) # all are positive
```

Is not relevant to decision-making and should be dropped.
```{r}
my_data <- subset(my_data, select = -id)

```

### member_id
A unique LC assigned Id for the borrower member.

is: integer
should be: integer, non-Null, non-zero
```{r}
filter(my_data, is.na(my_data$member_id)) # all are non-Null
filter(my_data, member_id == 0) # all are non-zero
filter(my_data, member_id < 0) # all are positive
```

This is probably not relevant, but it could be the case that a member has obtained prior loans, and then that could be very relevant for the interest rate on this loan.

```{r}
f <- filter(my_data, duplicated(my_data$member_id) == TRUE)
f
```
This never actually happens. So better to drop it.

```{r}
my_data <- subset(my_data, select = -member_id)

```


### loan_amt
The listed amount of the loan applied for by the borrower. If at some point in time, the credit department reduces the loan amount, then it will be reflected in this value.

Is: integer
Should be: integer, positive, non-zero, non-null

Data Exploration:
1.  Histogram of the loan_amt in various binwidths
```{r}
filter(my_data, is.na(my_data$loan_amnt)) # all are non-Null
filter(my_data, my_data$loan_amnt == 0) # all are non-zero
filter(my_data, my_data$loan_amnt < 0) # all are positive
summary(my_data$loan_amnt)
ggplot(data = my_data) +
  geom_histogram(aes(x = loan_amnt), binwidth = 100)
ggplot(data = my_data) +
  geom_histogram(aes(x = loan_amnt), binwidth = 250)
ggplot(data = my_data) +
  geom_histogram(aes(x = loan_amnt), binwidth = 500)
ggplot(data = my_data) +
  geom_histogram(aes(x = loan_amnt), binwidth = 1000)
```

### funded_amnt
The total amount committed to that loan at that point in time.

is: integer
should be: integer, non-null, non-negative

clear work done on this column

```{r}
funded_amnt <-raw_data$funded_amnt
```

View the data
```{r}

head(my_data$funded_amnt)
typeof(my_data$funded_amnt)
summary(my_data$funded_amnt)
filter(my_data, is.na(funded_amnt)) # no Null
filter(my_data, funded_amnt == 0) # no Zero
filter(my_data, funded_amnt < 0) # no negative
```

See where funded_amnt greater than, less than, equal to loan_amt

Visualize them together
```{r}
ggplot(data = my_data) +
  geom_point(aes(x = loan_amnt, y = funded_amnt), alpha = 0.05)

```


### funded_amnt_inv
The total amount committed by investors for that loan at that point in time.
is: character
should be: integer (or double?)

clear work done on this column
```{r}
funded_amnt_inv <-raw_data$funded_amnt_inv


```

Turn them into integers
```{r}
head(my_data$funded_amnt_inv)
my_data$funded_amnt_inv <- as.integer(my_data$funded_amnt_inv)
head(my_data$funded_amnt_inv)
```

See where funded_amnt_inv greater than, less than, equal to loan_amt
```{r}
ggplot(data = my_data) +
  geom_point(aes(x = loan_amnt, y = funded_amnt_inv), alpha = 0.05)
```
This shows two different curves, indicating two different types of bowers receiving investor funding. What is the difference between the two types of borrowers?

### term
The number of payments on the loan. Values are in months and can be either 36 or 60.


*NOTE* Gwen says this will not be available in new loan applications, but this makes no sense as there is a distinct difference in interest rates for 36 month versus 60 month loans.

is: character, either " 36 months" or " 60 months"
should be: integer, either 36 or 60 (to represent number of months). But does it even matter if it's essentially categorical?

This is similar to a categorical variable. Perhaps it explains the two curves observed in funded_amt_inv. (note, no, it does not appear to after analysis)

clear work done on this column
```{r}
term <-raw_data$term

```


Use this if keeping it as a string
```{r}
my_data$term <- str_trim(my_data$term)
head(my_data$term)
```

visualizing the data
```{r}
ggplot(data = my_data, aes(x = term)) + 
  geom_bar()
```

There are about twice as many 36 month loans as 60 month.

Let's plot the funded_amnt_inv filtering by 36 month and 60 month
```{r}
ggplot(data = filter(my_data, term == "36 months"), mapping = aes(x = loan_amnt, y = funded_amnt_inv)) +
    geom_point(alpha = .05)
```

This looks a lot like the other graph with no filtering.
```{r}
ggplot(data = filter(my_data, term == "60 months"), mapping = aes(x = loan_amnt, y = funded_amnt_inv)) +
    geom_point(alpha = .05)

```
This shows three curves. 
First, there is an upper curve where it begins linearly.
Second, there is a discontinuity where the slope drastically changes towards 0 (and possibly appears to be upward curving?)
Third, there is the lower curve, which appears more linear than those for 36 month terms, but still downward curving overall.


Visualize term versus loan amount
```{r}
ggplot() + 
  geom_boxplot(mapping = aes(x = term, y = loan_amnt), data = filter(my_data, term == "60 months")) + 
  geom_boxplot(mapping = aes(x = term, y = loan_amnt), data = filter(my_data, term == "36 months"))
```
60 month loans are for more money than 36 month loans

```{r}
ggplot() + 
  geom_boxplot(mapping = aes(x = term, y = funded_amnt), data = filter(my_data, term == "36 months")) + 
  geom_boxplot(mapping = aes(x = term, y = funded_amnt), data = filter(my_data, term == "60 months"))
```

The below provides a worthless graph. Nearly all loans fund regardless of term.
```{r}
ggplot() + 
  geom_boxplot(mapping = aes(x = term, y = funded_amnt / loan_amnt), data = filter(my_data, term == "36 months")) + 
  geom_boxplot(mapping = aes(x = term, y = funded_amnt / loan_amnt), data = filter(my_data, term == "60 months"))
```

### int_rate
Interest Rate on the loan

is: character
should be: float

```{r}
head(my_data$int_rate)
my_data$int_rate <- as.double(my_data$int_rate)
head(my_data$int_rate)


```
```{r}

ggplot(my_data) +
  geom_histogram(aes(x = int_rate), binwidth = .1)
ggplot(my_data) +
  geom_histogram(aes(x = int_rate), binwidth = .25)
ggplot(my_data) +
  geom_histogram(aes(x = int_rate), binwidth = .5)
ggplot(my_data) +
  geom_histogram(aes(x = int_rate), binwidth = 1)
```

Interest rate versus loan amount
60 month loans have a higher starting interest rate, and there are very few under 10,000 loan amount
```{r}
ggplot(data = my_data, mapping = aes(x = loan_amnt, y = int_rate, color = term)) +
    geom_point(alpha = .05)
```

### installment

The monthly payment owed by the borrower if the loan originates.

*Note* Gwen says this will not be available in the new data.
```{r}
my_data <- subset(my_data, select = -installment)
```


This should be directly correlated with the loan amount and interest rate and determined by term. It may be useful to use these values instead of those when say, comparing to income, in order to avoid calculation.

Is: string
Should be: double

```{r}
head(my_data$installment)
my_data$installment <- as.double(my_data$installment)
head(my_data$installment)
```

### emp_title

The job title supplied by the Borrower when applying for the loan.*
* Employer Title replaces Employer Name for all loans listed after 9/23/2013

This is a string entered by the user and should be dropped.
```{r}
my_data <- subset(my_data, select = -emp_title)

```

### emp_length

Employment length in years. Possible values are between 0 and 10 where 0 means less than one year and 10 means ten or more years. 

Is: string
Should be: Integer
```{r}
head(my_data$emp_length)

my_data$emp_length <- as.integer(ordered(my_data$emp_length, levels = c("< 1 year", "1 year", "2 years", "3 years", "4 years", "5 years", "6 years", "7 years", "8 years", "9 years", "10+ years"))) - 1

head(my_data$emp_length)
```

Search NA
```{r}

filter(my_data, is.na(my_data$emp_length))
```
More than 40,000 rows have no employment information. That seems like a lot to drop. How should we handle them?

## home_ownership

The home ownership status provided by the borrower during registration. Our values are: RENT, OWN, MORTGAGE, OTHER.

Is: String
Should be: Perhaps also an ordered set of integers ("OTHER", "RENT", "MORTGAGE", "OWN"). Could also make dummy columns, but this is my order if I were assessing borrowers. We should compare both models.
```{r}
head(my_data$home_ownership)

my_data$home_ownership <- as.integer(ordered(my_data$home_ownership, levels = c("OTHER", "RENT", "MORTGAGE", "OWN")))
head(my_data$home_ownership)


```

Drop nulls. There are very few (47)
```{r}
filter(my_data, is.na(my_data$home_ownership))
my_data <- filter(my_data, ! is.na(my_data$home_ownership))
filter(my_data, is.na(my_data$home_ownership))
```

### annual_inc
The self-reported annual income provided by the borrower during registration.

Is: string
should be: Int

Filter for null, zero, negative
```{r}
filter(my_data, is.na(my_data$annual_inc)) # four items are non-Null, drop them.
filter(my_data, my_data$annual_inc == 0) # two are zero income, but significant income from joint applicant, so don't drop them.
filter(my_data, my_data$annual_inc < 10000) # 460 applicants.
filter(my_data, my_data$annual_inc < 500) # only the two 0 income applicants
filter(my_data, my_data$annual_inc < 0) # all are positive
```

Drop NA
```{r}
my_data <- filter(my_data, ! is.na(annual_inc))

```

```{r}

head(my_data$annual_inc)
my_data$annual_inc <- as.integer(my_data$annual_inc)
head(my_data$annual_inc)
```
```{r}
count(filter(my_data, verification_status == 0))
```

```{r}
ggplot(data = filter(my_data, annual_inc > 1000000), aes(x = annual_inc, y = int_rate)) +
         geom_point(aes(color = verification_status))
```
We should divide income over $1mil by 100 because presumably some of these people entered dollars and cents, but the system did not recognize that they were entering cents.
```{r}
filter(my_data, annual_inc > 1000000)
```

Needs a decision on the limit of income, where we will assume incorrect data entry. I assume $1,000,000 should have been $10,000.00, but it could be higher or lower.
```{r}

my_data$annual_inc <- ifelse(my_data$annual_inc >= 1000000, my_data$annual_inc / 100, my_data$annual_inc)
filter(my_data, annual_inc > 1000000)


summary(my_data$annual_inc)
```


### verification_status

is_inc_v in the dictionary
Indicates if income was verified by LC, not verified, or if the income source was verified

make it an integer, with not verified at 0 in order to show the gain from verified

```{r}
head(my_data$verification_status)
my_data$verification_status <- as.integer(factor(my_data$verification_status)) - 1
head(my_data$verification_status)

```

no NA
```{r}
filter(my_data, is.na(my_data$verification_status))
```

Does verification have a benefit? Yes
```{r}
lm.fit <- lm(int_rate~verification_status, my_data)
summary(lm.fit)
```

```{r}
ggplot(data = filter(my_data, verification_status == 0), aes(x = verification_status, y = int_rate)) +
  geom_boxplot()
ggplot(data = filter(my_data, verification_status == 1), aes(x = verification_status, y = int_rate)) +
  geom_boxplot()
ggplot(data = filter(my_data, verification_status == 2), aes(x = verification_status, y = int_rate)) +
  geom_boxplot()
```
This is surprising that people with verified income have a higher interest rate than people with an unverified income.

### issue_d
The month which the loan was funded

*Note* Gwen says this will not be in new data.
```{r}
my_data <- subset(my_data, select = -issue_d)
```

Is: string
Should be: 

Is this useful? Probably would imagine that interest rates change.

Added the date as 01 in order to convert to a date.
```{r}
head(my_data$issue_d)
f <- my_data$issue_d

for (n in 1:length(f)){
  f[n] <- paste("01-", f[n], sep = "")
}
my_data$issue_d <- as.Date(f, format = "%d-%b-%Y")
head(my_data$issue_d)


```

See interest rate over time. I need to make a mean interest rate for each month, or maybe filter by term, to make this more helpful.
```{r}
ggplot(my_data, aes(x = issue_d, y = int_rate)) +
  geom_line()
```

### loan_status
Current status of the loan

This is for after the loan is issued, so I don't think it will be useful in predicting interest rate.

```{r}
levels(factor(my_data$loan_status))
```

This applies only to current loans and should be deleted
```{r}
my_data <- subset(my_data, select = -loan_status)

```

### pymnt_plan

Indicates if a payment plan has been put in place for the loan

This applies only to current, defaulted loans and should be deleted
```{r}
my_data <- subset(my_data, select = -pymnt_plan)
```

### url

URL for the LC page with listing data.

This information is not relevant to analysis and should be dropped.
```{r}
my_data <- subset(my_data, select = -url)

```

### desc

Loan description provided by the borrower

This information could be important but would be impossible to analyse since it is user-entered.
```{r}
my_data <- subset(my_data, select = -desc)

```

### purpose

A category provided by the borrower for the loan request. 

Is: String
Should be: Dummy columns.

```{r}
filter(my_data, is.na(my_data$purpose)) # all are non-Null
```

The answer here is relevant for the interest rate.
```{r}
levels(factor(my_data$purpose))
ggplot(my_data, aes(x = purpose, y = int_rate)) +
  geom_boxplot()

```
Should I remove the first dummy to guard against multicolliniarity?
```{r}
my_data <- dummy_columns(my_data, select_columns = "purpose", remove_selected_columns = TRUE)

```

Need to see whether each category has a good number of observations.

### title

The loan title provided by the borrower


This information could be important but would be impossible to analyse since it is user-entered.
```{r}
my_data <- subset(my_data, select = -title)

```

### zip_code

The first 3 numbers of the zip code provided by the borrower in the loan application.

This information would indicate where in the country the borrower is located. It might be a relevant variable, but it would at best be an approximation of a borrower's income. I do not think it provides significant information.

```{r}
my_data <- subset(my_data, select = -zip_code)

```


### addr_state

The state provided by the borrower in the loan application


This information would indicate where in the country the borrower is located. It might be a relevant variable, but it would at best be an approximation of a borrower's income. I do not think it provides significant information.

```{r}
my_data <- subset(my_data, select = -addr_state)

```

### dti

A ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly income.

Is: string
Should be: double

```{r}
filter(my_data, is.na(dti)) # no Null
filter(my_data, dti == 0) # 403 are zero, but this indicates no debt. Seems odd they turn to Lending Club for a loan if this is true.
filter(my_data, dti < 1) # 3266 are below this ratio
filter(my_data, dti < 0) # no negative
```

```{r}
head(my_data$dti)
f <- as.double(my_data$dti)
head(f)
```


### delinq_2yrs

The number of 30+ days past-due incidences of delinquency in the borrower's credit file for the past 2 years

Is: Integer
Should be: Integer
```{r}
typeof(my_data$delinq_2yrs)
summary(my_data$delinq_2yrs)
```

There are 21 NAs. Drop them.
```{r}
my_data <- filter(my_data, ! is.na(my_data$delinq_2yrs))
summary(my_data$delinq_2yrs)

```


### earliest_cr_line

The month the borrower's earliest reported credit line was opened

```{r}
filter(my_data, is.na(earliest_cr_line)) # no NA
```


This could be a placeholder for the borrower's age. Is it correlated with interest rate? I would imagine that if this date is old, the person is old, so has a higher interest rate, and if young, low income, so higher interest rate. But this would simply be because of my expectations of income by age, and this data does not directly tell us anything.

Added the date as 01 in order to convert to a date.
```{r}
head(my_data$earliest_cr_line)
f <- my_data$earliest_cr_line

for (n in 1:length(f)){
  f[n] <- paste("01-", f[n], sep = "")
}
my_data$earliest_cr_line <- as.Date(f, format = "%d-%b-%Y")
head(my_data$earliest_cr_line)


```

Drop the data as it is not really useful.
```{r}
my_data <- subset(my_data, select = -earliest_cr_line)

```

### inq_last_6mths

The number of inquiries in past 6 months (excluding auto and mortgage inquiries)

This represents how many times a lender has pulled the borrower's credit report. In itself it does not mean anything, but a higher value is taken as an indication that the borrower is not creditworthy (else, why ask for so much debt?) 0-2 is fine.

There are no NA values

```{r}
head(my_data$inq_last_6mths)
count(my_data, is.na(inq_last_6mths))
```
Plot the range
```{r}
ggplot(data = my_data, aes(x = inq_last_6mths)) +
  geom_histogram(binwidth = 1)
```
Power law distribution.

```{r}
lm.1 <- lm(int_rate~inq_last_6mths, my_data)
summary(lm.1)
```
This has a significant impact on interest rate.

### mths_since_last_delinq

The number of months since the borrower's last delinquency.

This is the time that has passed since a buyer was delinquent. Approximately half of the entries are NA. *HOWEVER* if the borrower has never been delinquent (that is, always pays on time), it *SHOULD* be NA. NA is a better value than having anything here. *BUT* the data shows this is not actually true, and those with NA have a higher interest rate.

Possible solutions: very, very high value to replace NA?

```{r}
levels(factor(my_data$mths_since_last_delinq))
count(my_data, is.na(mths_since_last_delinq))
```
```{r}
ggplot(data = my_data, aes(x = mths_since_last_delinq, y = int_rate)) +
  geom_point(alpha = 0.05)
```
There is a hard cutoff at 84 months because those delinquencies are no longer reported on a credit report (7 years). There are still some observations beyond that time period, but I am not sure why. They may pre-date the change in the law (I recall this was sometime around 2009-2010) or loopholes that allow continuing reporting, or self-reporting to LendingClub.

We need to figure out a way to deal with the <84, >=84 problem.

```{r}
count(my_data, mths_since_last_delinq == 83)
count(my_data, mths_since_last_delinq == 84)


```

Those with an NA have a higher interest rate, which I would not expect.
```{r}
ggplot(data = my_data) +
  geom_boxplot(mapping = aes(x = is.na(mths_since_last_delinq), y = int_rate))
```


### mths_since_last_record

The number of months since the last public record.

This would be the months since the last public record filing, which means a lawsuit to collect a debt.
From my old credit report from 2015: "Public record information includes bankruptcies, liens or judgments and comes from federal, state or county court records." Today, they include only bankruptcies, but this is not relevant.

This is very, very bad, and much better if it is NA (i.e. there are none)

```{r}
levels(factor(my_data$mths_since_last_record))
count(my_data, is.na(mths_since_last_record))
```

```{r}
ggplot(data = my_data, aes(x = mths_since_last_record, y = int_rate)) +
  geom_point(alpha = 0.05)
```
I do not see a pattern appear.

They also remain on your credit report for ten years during this time period (this is no longer true, but not relevant)


```{r}
count(my_data, mths_since_last_record == 119)
count(my_data, mths_since_last_record == 120)

```

Thos with no public records have higher interest rates than those who do, which I would not expect.

```{r}
ggplot(data = my_data) +
  geom_boxplot(mapping = aes(x = is.na(mths_since_last_record), y = int_rate))
```

### open_acc

The number of open credit lines in the borrower's credit file.

Is: Integer
Should be: Integer

I would expect this to be generally on the high side for users of this platform.


```{r}
typeof(my_data$open_acc)
levels(factor(my_data$open_acc))
count(my_data, is.na(open_acc)) # no NA
```

```{r}
ggplot(data = my_data, aes(x = open_acc)) +
  geom_histogram()

summary(my_data$open_acc)
```
Is this associated with int_rate?
```{r}
lm.2 <- lm(int_rate~open_acc, my_data)
summary(lm.2)
```
This says the interest rate goes down for people with more open accounts and that this is significant, which seems odd. The relationship is probably non-linear, if significant.

### pub_rec

Number of derogatory public records

Is: Integer
Should be: Integer

```{r}
typeof(my_data$pub_rec)
count(my_data, is.na(pub_rec)) # No NA
```
Histogram shows very, very strongly power law.
```{r}
ggplot(data = my_data, aes(x = pub_rec)) +
  geom_histogram()
summary(my_data$pub_rec)
```

### revol_balance

Total credit revolving balance

This is the revolving debt (credit card, usually) of the borrower. It will be a part of the dti calculation.

```{r}
head(my_data$revol_bal)
count(my_data, is.na(revol_bal))
filter(my_data, is.na(revol_bal))
count(my_data, revol_bal == 0)
```

Two are na. One has a revol_util figure, so I expect this is a data error. The other has a revol_util of 0, so this may be showing a revol_bal == 0. Since it is only two observations, we should drop them.

```{r}
filter(my_data, is.na(my_data$revol_bal))
my_data <- filter(my_data, ! is.na(my_data$revol_bal))
filter(my_data, is.na(my_data$revol_bal))
```

### revol_util

Revolving line utilization rate, or the amount of credit the borrower is using relative to all available revolving credit.

This is a ratio showing how maxed out the borrower's credit cards are. We would expect creditworthy borrowers to have a low ratio.

Is: string
Should be: double

```{r}
typeof(my_data$revol_util)
count(my_data, is.na(revol_util)) # 429 are NA. Do we drop them? If we use this variable, then probably. But is this variable predictive enough to use? Likely not.
my_data$revol_util <- as.double(my_data$revol_util)
head(my_data$revol_util)

```

```{r}
ggplot(data = my_data, aes(x = revol_util, y = int_rate)) +
  geom_point(alpha = 0.05)
```

Has a fairly normal distribution.
```{r}
ggplot(data = my_data, aes(x = revol_util)) +
  geom_histogram(binwidth = 1)
```

### total_acc

The total number of credit lines currently in the borrower's credit file

I do not expect this to be useful to the analysis. It does not distinguish between open or closed accounts. In general not a good predictor.

Is: integer
Should be: integer

```{r}
head(my_data$total_acc)
```
### initial_list_status

The initial listing status of the loan. Possible values are – W, F

"Lending Club reserves a few loans for 12 hours and offers them to the institutional and large retail lenders who want to lend the whole amount for a loan. I am not sure whether the historical loan data file includes the loans that were offered and picked up by lenders as 'whole' loans. But, the loans that were initially offered as whole, designated with 'w', but not picked up as 'whole' loans are listed in historical loan data file."
https://andirog.blogspot.com/2013/04/lending-club-borrowers-income.html

This may be useful because those loans taken by institutional and large lenders may be the most preferred loans by lenders.
```{r}
head(my_data$initial_list_status)
```
Is there a difference?

```{r}
ggplot(data = my_data, aes(x = initial_list_status, y = int_rate)) +
  geom_boxplot()
```
w loans tend to have a lower interest rate.

Make it an integer, with f == 0 and w == 1, but this could instead be dummy columns

```{r}
head(my_data$initial_list_status)
my_data$initial_list_status <- as.integer(factor(my_data$initial_list_status)) - 1
head(my_data$initial_list_status)

```
Is this useful?
```{r}
lm.3 <- lm(int_rate~initial_list_status, my_data)
summary(lm.3) # R^2 of 0.01, so not a huge contributor, but the values show it is significant.
```
### out_prncp

Remaining outstanding principal for total amount funded

This is only relevant after a loan is issued and should be dropped.
```{r}
my_data <- subset(my_data, select = -out_prncp)

```

### out_prncp_inv

Remaining outstanding principal for portion of total amount funded by investors

This is only relevant after a loan is issued and should be dropped.
```{r}
my_data <- subset(my_data, select = -out_prncp_inv)

```

### total_pymnt

Payments received to date for total amount funded


This is only relevant after a loan is issued and should be dropped.
```{r}
my_data <- subset(my_data, select = -total_pymnt)

```

### total_pymnt_inv

Payments received to date for portion of total amount funded by investors


This is only relevant after a loan is issued and should be dropped.
```{r}
my_data <- subset(my_data, select = -total_pymnt_inv)

```


```{r}


head(my_data$dti_joint)
f <- as.double(my_data$dti_joint)
summary(f)
```

